In [21]:
def create_download_link(df=pd.DataFrame, title = "Click to Download", filename=str, level='GP'):
    if level=='Village':
        csv = df.to_csv(index=True,index_label=['District','Janpad','Gram Panchayat','Village'])
    elif level=='GP':
        csv = df.to_csv(index=True,index_label=['District','Janpad','Gram Panchayat'])
    elif level=='Janpad':
        csv = df.to_csv(index=True,index_label=['District','Janpad'])
    elif level=='District': 
        csv = df.to_csv(index=True,index_label=['District'])
    else:
        csv = df.to_csv(index=True,index_label=['District','Janpad','Gram Panchayat'])
    b64 = base64.b64encode(csv.encode())
    payload = b64.decode()
    html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
    ###html = '<input type="button" value="Download" download="{filename}" onclick="data:text/csv;base64,{payload}" target="_blank">{title}</input>'
    html = html.format(payload=payload,title=title,filename=filename)
    return HTML(html)

Section 1: Information about the file -

  1. Total records
Out[22]:
8831779
  1. Number of Columns (Data Columns)
Out[23]:
23
  1. Column Names and Data Type
Out[24]:
Index(['District', 'Janpad', 'Gram Panchayat', 'Village',
       'Name of head of household', 'GOI ID', 'Column_1', 'Economic status',
       'Sub-category', 'Has Toilet', 'Type of toilet', 'Has Water Source',
       'Water source', 'State', 'Village ID', 'GP ID', 'Janpad ID',
       '% Headed by Women', '% Headed by PWD', '% Having No Toilet',
       '% Having No Water Source', 'Total Household',
       '% Households having Toilet but no Water Source'],
      dtype='object')

Section 2: Snapshot of the data -

Districts:  51   |  Janpad:  313   |  Gram Panchayats:  22550   |  Villages:  50049   |  Surveys:  8831779

Does the data contain any empty cell?

True - Yes
False - No
In [26]:
raw_data.empty
Out[26]:
False

Identify duplicate records

Based on the combination - District+Janpad+Gram Panchayat+Village+Name of head of household+GOI ID

In [27]:
raw_data["is_duplicate_record"] = raw_data.duplicated(subset=("District",
             "Janpad",
             "Gram Panchayat",
             "Village",
             "Name of head of household",
             "GOI ID"),
                                                     keep='last')

Section 3: Data Error Checks

1. Total Duplicate records:

Out[28]:
65222

Deduplicating the data

Number of unique Records -

Out[29]:
8766557

2. Missing Data in Mandatory Columns:

-----
Series([], Name: Column_1, dtype: int64)
Missing Data:  8766557

Observations:

a. All the values in Column_1 are missing. b. There is no other missing value error.

3. Number of records where toilet is available in the household but type of toilet has either not been provided or mentioned as not available -

In [31]:
"""Data Error Condition 3: Number of Records where toilet is available but type of toilet has been mentioned उपलब्ध नहीं है"""
total_hh=np.sum(raw_data['Total Household'])
                
raw_data_tlt_nttype = raw_data[(raw_data["Has Toilet"] == 'हाँ') & (raw_data["Type of toilet"] == 'उपलब्ध नहीं है')]
x=raw_data_tlt_nttype.shape[0]
y=round(x/total_hh*100,2)
print("No of Records:",x," (",y,"%)")
No of Records: 8756  ( 0.1 %)

4. Number of records where toilet is not available in the household but type of toilet has been mentioned -

In [32]:
"""Data Error Condition 4: Number of Records where toilet is not available but type of toilet has been mentioned other than उपलब्ध नहीं है"""
raw_data_ntlt_ttype=raw_data[(raw_data["Has Toilet"] == 'नहीं') & (raw_data["Type of toilet"] != 'उपलब्ध नहीं है')]

x=raw_data_ntlt_ttype.shape[0]
y=round(x/total_hh*100,2)
print("No of Records:",x," (",y,"%)")
No of Records: 3262  ( 0.04 %)

5. Number of records where water source is available in the household but type of source has either not been provided or mentioned as not available -

In [33]:
"""Data Error Condition 3: Number of Records where water source is available but type of source has been mentioned as not available"""
raw_data_ws_nwstype=raw_data[(raw_data["Has Water Source"] == 'हाँ') & ((raw_data["Water source"] == 'उपलब्ध नहीं है') | (raw_data["Water source"] == 'पानी का कोई स्त्रोत नहीं'))]
x=raw_data_ws_nwstype.shape[0]
y=round(x/total_hh*100,2)
print("No of Records:",x," (",y,"%)")
No of Records: 321556  ( 3.67 %)

6. Number of records where water source is not available in the household but type of source has been provided -

In [34]:
"""Data Error Condition 3: Number of Records where water source is not available but type of source has been mentioned as available"""
raw_data_nws_wstype=raw_data[(raw_data["Has Water Source"] == 'नहीं') & (raw_data["Water source"] != 'उपलब्ध नहीं है') & (raw_data["Water source"] != 'पानी का कोई स्त्रोत नहीं')]

x=raw_data_nws_wstype.shape[0]
y=round(x/total_hh*100,2)
print("No of Records:",x," (",y,"%)")
No of Records: 792292  ( 9.04 %)

Section 2: Data Analysis

1. Overview

Total Households Surveyed in Verification:

Out[35]:
8766557

Households by Economic Status

Number of households by Economic Status
-----------------------
     Total Households      %
BPL           4448441  50.74
APL           4318116  49.26

Observations:

  1. Distribution of households by economic status is quite even, i.e. there are almost 50% BPL and 50% APL families where verification of toilet has been conducted.

Households by Social Category

Number of households in
-----------------------
ST                                  2185368
GENERAL                             2082756
Other                               1987984
SC                                  1310257
Small & Marginal Farmers             895338
Women Headed HH                      203653
Landless with homestead               74495
Headed by Person with Disability      26706
Name: Sub-category, dtype: int64
-----------------------
% of households in
-----------------------

Households by Economic Status and Social Category

Observations -

  1. There are 22.7% households in "Others" category
  2. All the General category households are BPL
  3. BPL Households has only General, SC and ST category. No landless family, or household headed by a woman or physically challenged person has been reported as BPL.

2. Status of Toilets in Households

Availability of Toilets in the Households

Number of households having toilet
-----------------------
Yes    8315353
No      451204
Name: Has Toilet, dtype: int64
-----------------------
Number of Household having toilet(or not) by Economic Status:
Has Toilet Yes No
Economic status
BPL 4205129 243312
APL 4110224 207892
Number of Household having toilet(or not) by Social Category:
Sub-category Small & Marginal Farmers Other GENERAL ST SC Landless with homestead Women Headed HH Headed by Person with Disability
Has Toilet
Yes 864974 1870337 1975332 2063184 1249649 72033 194271 25573
No 30364 117647 107424 122184 60608 2462 9382 1133

Observations:

  1. 94.85% of the housheolds have toilet
  2. However 4.55 lakh households are still devoid of toilet facility. ST households are the most affected

Availabilty of Different Types of Toilet across Households

In [47]:
print("Number of households having different types toilet")
print("--------------------------------------------------")
hh_hvng_tlt_type_cnt = hh_with_toilet["Type of toilet"].apply(lambda x: "Single Pit" if x=="एक गड्ढे वाला" else ("Double Pit" if x=="दो गड्ढे वाला" else ("Septic Tank" if x=="सेप्टिक टैंक" else ("Septic Tank with Soak Pit" if x=="सोखते गड्ढे के साथ सेप्टिक टैंक" else ("Bio Toilet" if x== "बायो-शौचालय" else "Not Available"))))).value_counts()
print(hh_hvng_tlt_type_cnt)
print("--------------------------------------------------")
Number of households having different types toilet
--------------------------------------------------
Single Pit                   3727702
Double Pit                   2636584
Septic Tank                  1586399
Septic Tank with Soak Pit     342754
Bio Toilet                     13158
Not Available                      0
Name: Type of toilet, dtype: int64
--------------------------------------------------
In [48]:
display_donut(hh_with_toilet["Type of toilet"].apply(lambda x: "Single Pit" if x=="एक गड्ढे वाला" else ("Double Pit" if x=="दो गड्ढे वाला" else ("Septic Tank" if x=="सेप्टिक टैंक" else ("Septic Tank with Soak Pit" if x=="सोखते गड्ढे के साथ सेप्टिक टैंक" else ("Bio Toilet" if x== "बायो-शौचालय" else "Not Available"))))),
              width=5,
              height=5,
              title="% of Households having different Types of Toilets",
              pct=True)

3. Status of Water Sources in the Households

Availability of Water Source in the Households

In [49]:
print("Number of households having water source")
print("-----------------------")
display(raw_data["Has Water Source"].apply(lambda x: "Yes" if x=="हाँ" else "No").value_counts())
Number of households having water source
-----------------------
Yes    7781886
No      984671
Name: Has Water Source, dtype: int64
In [50]:
display_donut(raw_data["Has Water Source"].apply(lambda x: "Yes" if x=="हाँ" else "No"),
              width=5,
              height=5,
              title="Availability of Water Source in Households",
              pct=True)

Availability of water source in households across different economic status

Has Water Source Yes No
Economic status
BPL 3875688 572753
APL 3906198 411918

Availability of water source in households across different social strata

Sub-category Small & Marginal Farmers Other GENERAL ST SC Landless with homestead Women Headed HH Headed by Person with Disability
Has Water Source
Yes 831712 1802052 1856359 1838507 1177699 68898 182543 24116
No 63626 185932 226397 346861 132558 5597 21110 2590

Observation:

  1. 88.77% of the households have a water source
  2. Howerver, around 9.9 Lakh Households don't have water source. ST category households are the worst affected.

Availabilty of Different Types of Water Sources across Households

Number of households having different types of water source
--------------------------------------------------
Handpump                  4125863
Piped Water Connection    2212615
Well                      1090772
Pond                        31080
No Source of Water              0
Data Missing                    0
Name: Water source, dtype: int64
--------------------------------------------------

Availability of Toilets and Water Source in the Households

Has Water Source      Yes      No
Has Toilet                       
Yes               7763026  552327
No                  18860  432344

Observation:

  1. Handpump is the major source of water in households (55% of households use handpumps)
  2. Water Supply scheme is available in only 29.7% ~ 30% of the households and the remaining.
  3. 14% of the households rely on wells
  4. Around 5.56 Lakh households having toilets has no water source

Section 3: Status of toilets across State and Districts

Out[58]:
% Having No Toilet % Having No Water Source % Headed by PWD % Headed by Women % Households having Toilet but no Water Source Total Household
District
AGAR MALWA 3.85 6.16 0.15 1.35 2.88 76628
ALIRAJPUR 0.96 3.4 0.15 2.12 2.61 107318
ANUPPUR 3.62 21.9 0.33 4.35 18.37 115085
ASHOKNAGAR 1.02 4.42 0.29 1.74 3.45 90075
BALAGHAT 6.25 12.6 0.15 1.85 6.67 272765
BARWANI 9.33 17.08 0.06 0.92 8 177163
BETUL 3.78 8.77 0.28 3.01 5.15 203891
BHIND 2.55 3.22 0.34 3.66 0.87 156182
BHOPAL 1.85 3.54 0.3 1.48 1.73 79034
BURHANPUR 4.77 10.26 0.07 1.13 5.58 112210
CHHATARPUR 9.48 19.99 0.49 3.23 10.99 188637
CHHINDWARA 1.79 5.07 0.45 3.16 3.34 318605
DAMOH 7.7 18.79 0.32 2.65 11.25 200271
DATIA 4.53 5.88 0.22 2.18 1.58 75760
DEWAS 4.21 6.22 0.21 2.1 2.25 199603
DHAR 0.98 2.63 0.17 2 1.71 289226
DINDORI 7.84 20.25 0.11 2.05 12.53 132646
GUNA 7.01 16.05 0.21 1.63 9.22 131315
GWALIOR 1.19 1.93 0.24 2.02 1 104985
HARDA 1.96 5.53 0.1 0.58 3.66 89328
HOSHANGABAD 2.43 5.95 0.24 1.36 3.65 159055
INDORE 0.33 0.71 0.1 1.06 0.4 152346
JABALPUR 23.24 33.74 0.25 2.15 11.12 182602
JHABUA 1.9 8.2 0.51 1.7 6.44 161301
KATNI 4.11 11.63 0.28 2.21 7.69 183285
KHANDWA 3.77 6.92 0.3 2.94 3.54 191797
KHARGONE 9.14 17.6 0.49 3.12 8.7 227105
MANDLA 8.52 24.42 0.35 3.31 16.17 201652
MANDSAUR 3 6.14 0.32 2.75 3.4 185292
MORENA 3.18 4.5 0.39 3.23 1.47 138933
NARSINGHPUR 1.23 3.65 0.35 1.35 2.49 183162
NEEMUCH 2.51 9.07 0.45 2.52 6.85 92330
PANNA 5.98 12.11 0.24 1.85 6.29 138814
RAISEN 3.01 5.63 0.37 1.97 3.08 194332
RAJGARH 7.36 14.53 0.38 2.41 7.58 170590
RATLAM 1.13 3.87 0.25 2.7 2.78 169367
REWA 9.65 18.9 0.26 2.66 9.54 346838
SAGAR 2.1 6.45 0.26 2.39 4.46 265029
SATNA 8.9 22.37 0.28 2.29 13.76 335441
SEHORE 1.25 4.22 0.32 1.16 3.05 202092
SEONI 15.12 27.76 0.57 3.28 13.07 226343
SHAHDOL 1.66 7.11 0.36 2.6 5.53 150804
SHAJAPUR 0.63 1.61 0.19 1.55 1.05 144327
SHEOPUR 1.02 1.55 0.31 2.66 0.85 95472
SHIVPURI 8.49 14.26 0.29 2.56 5.99 195824
SIDHI 4.54 13.29 0.67 3.49 8.93 159689
SINGROULI 4.61 19.38 0.3 2.17 14.87 134310
TIKAMGARH 1.54 3.49 0.63 2.57 2.01 174490
UJJAIN 2.05 3.19 0.19 1.92 1.19 208372
UMARIA 6.99 19.57 0.11 1.3 13.04 98484
VIDISHA 7.96 12.88 0.44 2.2 5.35 176352
All 5.15 11.23 0.3 2.32 6.3 8766557

Observations

  1. In Jabalpur 23.27% of the households do not have toilet, followed by Seoni with 15.1% such households
  2. Again, Jabalpur and Seoni are the two districts with highest % of households without any source of water (with 33.75% and 27.78% respectively)
  3. Anuppur and Mandala districts have the highest proportion of households having a toilet but no water source (with 18.37% and 16.2% respectively)

Section 4: List of Distribution of Households

Number of duplicate records region wise

In [59]:
dup_rec=raw_data_dup.pivot_table(index=["District","Janpad","Gram Panchayat"],
                         values='Total Household',
                        aggfunc=np.sum).sort_values('Total Household', ascending=False)

create_download_link(dup_rec,filename="Household Verification - Duplicate Surveys by region.csv")

Status of toilets across State and Districts

In [60]:
hh_tnu_tlt_status = raw_data.pivot_table(index=['District','Janpad','Gram Panchayat'], 
                     values=['Total Household', 
                             '% Headed by Women', 
                             '% Headed by PWD', 
                             '% Having No Toilet', 
                             '% Having No Water Source',
                            '% Households having Toilet but no Water Source'],
                     aggfunc={'Total Household':np.sum, 
                              '% Headed by Women':np.mean, 
                              '% Headed by PWD':np.mean, 
                              '% Having No Toilet':np.mean, 
                              '% Having No Water Source':np.mean,
                             '% Households having Toilet but no Water Source':np.mean},
                    margins=True).apply(lambda x: round(x,2))

create_download_link(hh_tnu_tlt_status, filename="Household Verification - Status of Toilets  in the region.csv")

Number of Households across different socio-economic categories

In [61]:
hh_by_scio_eco=raw_data.pivot_table(index=["District","Janpad", "Gram Panchayat"], 
                        columns=["Economic status", "Sub-category"],
                        values=['Total Household'],
                        aggfunc={'Total Household':np.sum}, 
                        margins=True)

create_download_link(hh_by_scio_eco,filename="Household Verification - Distribution of Households by Socio Economic Status.csv")

List of Households Not Having Toilet

In [62]:
hh_no_tlt=raw_data[raw_data['Has Toilet']!="हाँ"].pivot_table(index=["District",
                                                           "Janpad",
                                                           "Gram Panchayat",
                                                           "Village"],
                     values='Total Household',
                     aggfunc=np.sum).sort_values('Total Household', ascending=False)

create_download_link(hh_no_tlt,filename="Household Verification - Households not having toilet.csv",level="Village")

Number of records where toilet is available in the household but type of toilet has either not been provided or mentioned as not available

In [63]:
raw_data_tlt_nttype_pv=raw_data_tlt_nttype.pivot_table(index=["District","Janpad","Gram Panchayat"],
                         values='Total Household',
                        aggfunc=np.sum).sort_values('Total Household', ascending=False)

create_download_link(raw_data_tlt_nttype_pv,filename="Household Verification Error - Missing/Incorect Type of Toilet.csv")

Number of records where toilet is not available in the household but type of toilet has been mentioned

In [64]:
raw_data_ntlt_ttype_pv=raw_data_ntlt_ttype.pivot_table(index=["District","Janpad","Gram Panchayat"],
                         values='Total Household',
                        aggfunc=np.sum).sort_values('Total Household', ascending=False)

create_download_link(raw_data_ntlt_ttype_pv,filename="Household Verification Error - No Toilet but Type Mentioned.csv")

Number of records where water source is available in the household but type of source has either not been provided or mentioned as not available

In [65]:
raw_data_ws_nwstype_pv=raw_data_ws_nwstype.pivot_table(index=["District","Janpad","Gram Panchayat"],
                         values='Total Household',
                        aggfunc=np.sum).sort_values('Total Household', ascending=False)

create_download_link(raw_data_ws_nwstype_pv,filename="Household Verification Error - Water Source Missing.csv")

Number of records where water source is not available in the household but type of source has been provided

In [66]:
raw_data_nws_wstype_pv=raw_data_nws_wstype.pivot_table(index=["District","Janpad","Gram Panchayat"],
                         values='Total Household',
                        aggfunc=np.sum).sort_values('Total Household', ascending=False)

create_download_link(raw_data_nws_wstype_pv,filename="Household Verification Error - No Water but Source Provided.csv")